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Abstract 

Various distribution free goodness-of-fit test procedures have been 
extracted from literature. We present two new binning free tests, 
the univariate three-region-test and the multivariate energy test. The 
power of the selected tests with respect to different slowly varying 
distortions of experimental distributions are investigated. None of the 
tests is optimum for all distortions. The energy test has high power 
in many applications and is superior to the test. 

1 INTRODUCTION 

Goodness-of-fit (gof) tests are designed to measure the compatibility of a 
random sample with a theoretical probability distribution function (pdf). 
The null hypothesis Hq is that the sample follows the pdf. Under the as- 
sumption that Hq applies, the fraction of wrongly rejected experiments - the 
probability of committing an error of the first kind - is fixed to typically a 
few percent. A test is considered powerful if the probability of accepting Hq 
when Hq is wrong - the probability of committing an error of the second kind 
- is low. Of course, without specifying the alternatives, the power cannot be 
quantified. 

A discrepancy between a data sample and the theoretical description can 
be of different origin. The problem may be in the theory which is wrong 
or the sample may be biased by measurement errors or by background con- 
tamination. In natural sciences we mainly have the latter situation. Even 
though the statistical description is the same in both cases the choice of the 
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specific test may be different. In our applications we are mainly confronted 
with "slowly varying" deviations between data and theoretical description 
whereas in other fields where for example time series are investigated, "high 
frequency" distortions are more likely. 

Goodness-of-fit tests are based on classical statistical methods and are 
closely related to classical interval estimation, but they contain also Bayesian 
elements. Those, however, are only related with some prejudice on the alter- 
native hypothesis which affects the purity of the accepted decisions and not 
the error of the first kind. 

The power of one dimensional tests is not always invariant against trans- 
formations of the variates. In more than one dimension (number of variates), 
an invariant description is not possible. 

Tests are classified in distribution dependent and distribution free tests. 
The former are adapted to special pdfs like Gaussian, exponential or uni- 
form distributions. We will restrict our discussion mainly to distribution free 
tests and tests which can be adapted to arbitrary distributions. Here we 
distinguish tests applied to binned data and binning free tests. The latter 
are in principle preferable but so far they are almost exclusively limited to 
one dimensional distributions. A further distinction concerns the alternative 
hypothesis. Usually, it is not restricted but there exist also tests where it is 
parametrized. 

Physicists tend to be content with tests which are not necessarily opti- 
mum in all cases. A very useful and comprehensive survey of goodness-of-fit 
tests can be found in Ref. [H from 1986. Since then, some new develop- 
ments have occurred and the increase in computing power has opened the 
possibility to apply more elaborate tests. 

In Section 2 we summarize the most important tests. To keep this article 
short we do not discuss tests based on the order statistic and spacing tests. 
In Section 3 we introduce two new tests, the three region test and the energy 
test. To compare the tests we apply them in Section 4 to some specific 
alternative hypotheses. We do not consider explicitly composite hypotheses. 



2 SOME RELEVANT TESTS 
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2.1 Chi-squared test 



The test is closely connected to least square fits with the difference that 
the hypothesis is fixed. The test statistic is 



with the random variable Yi. ti the expectation E{Yi) and S'^ the expectation 
E{{Yi — tiY). Obviously the expectation value of is 



In the Gaussian approximation follows a Gaussian with mean ti and vari- 
ance 5l and the test statistic follows a x^ distribution function Fb{'X^) with 
B degrees of freedom. The probability of an error of the first kind a (signifi- 
cance level, p- value) defines Xo with Fb(xo) = 1 — ck- The null hypothesis is 
rejected if in an actual experiment we find X^ > Xo- 

We obtain the Pearson test when the random variables are Poisson 
distributed. 



In the large sample limit the test statistic x^ has approximately a x^ 
distribution with B degrees of freedom. When we have a histogram, where 
a total of events are distributed according to a multinomial distribution 
among the B bins with probabilities pi we get 



which again follows asymptotically a x^ distribution, this time with B — 1 
degrees of freedom. The reduced number of degrees of freedom is due to the 
constraint SA^j = A^. 

Nowadays, the distribution function of the test statistic can be computed 
numerically without much effort. The x^ test then can also be applied to 
small samples. The Gaussian approximation is no longer required. 

The x^ test is very simple and needs only limited computational power. A 
big advantage compared to most of the other methods is that is can be applied 
to multidimensional histograms. There are however also serious drawbacks: 
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• Its power in detecting slowly varying deviations of a histogram from 
predictions is rather poor due to the neglect of possible correlations 
between adjacent bins. 

• Binning is required and the choice of the binning is arbitrary. 

• When the statistics is low or the number of dimensions is high, the 
event numbers per bin may be low. Then the asymptotic properties 
are no longer valid and systematic deviations are hidden by statistical 
fluctuations. 

There are proposals to fix the bin widths by the requirement of equal 
number of expected entries per bin. This is not necessarily the optimum 
choice 0. Often there are outliers in regions where no events are expected 
which would be hidden in wide bins. 

For the number of bins a dependence on the sample size n 

B = 2n2/^ 

is proposed in Ref. 0]. Our experience is that in most experiments the num- 
ber of bins is chosen too high. The sensitivity to slowly varying deviations 
roughly goes with 5"^/^ In multidimensional cases the power of the test 
often can be increased by applying it to the marginal distributions. 

There is a whole class of like tests. Many studies can be found in the 
literature. The reader is referred to Ref. 0. 

2.2 Binning-free empirical distribution function tests 

The tests described in this section have been taken from the article by 
Stephens in Ref. Jll. 

Supposing that a random sample of size n is given, we form the order 
statistical < X2 < ... < Xn- We consider the empirical distribution function 
(EDF) 

# of observations < x 

Fn{x) = ^ 

n 

or 

Fn{x) = X < Xi 

Fn{x) = { Xi<X< Xi+i 

= 1 X„ < X 
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Figure 1: Comparison of empirical and tlieoretical distributions 

Fn{x) is a step function which is to be compared to the distribution F{x) 
corresponding to Hq. 

The EDF is consistent and unbiased. The tests discussed in this section 
are invariant under transformation of the random variable. Because of this 
feature, we can transform the distribution to the uniform distribution and 
restrict our discussion to the latter. 

2.2.1 Probability integral transformation 

The probability integral transformation (PIT) 

Z = F{X) 

transforms a general pdf f{X) of X into the uniform distribution f*{Z) of 
Z . 

f*{Z) = 1; 0<Z<1 
F*{Z) = Z 

The underlying idea of this transformation is that the new EDF olZ, F*{Z) is 
extremely simple and that it conserves the distribution of the test quantities 
discussed in this section. It is easily seen that 
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F^{x) - F{x) = F:{z) - z 



Note, however, that the PIT does not necessarily conserve all interesting 
features of the gof problem. Resolution effects are washed out and for exam- 
ple in a lifetime distribution, an excess of events at small and large lifetimes 
may be judged differently but are treated similarly after a PIT. It is not log- 
ical to select specific gofs for specific applications but to transform all kinds 
of pdfs to the same uniform distribution. The PIT is very useful because it 
permits standardization but one has to be aware of its limitations. 

2.2.2 Supremum statistics 

The maximum positive (negative) deviation of F^^x) from F{x) {D ) 
(see Fig. 1) are used as tests statistics. Kolmogorov (Kolmogorov-Smirnov 
test) has proposed to use the maximum absolute difference. Kuiper uses 
the sum V = + D^. This test statistic is useful for observations "on 
the circle" for example for azimuthal distributions where the zero angle is a 
matter of definition. 

L>+ = sup{F„(x) -F(a;)} 

X 

D_ = sup{F(a;) -F„(x)} 

X 

D — sup {|F„(x) — F(x)|} (Kolmogorov) 

X 

V = D+ + D_ (Kuiper) 
The supremum statistics are invariant under the PIT. 

2.2.3 Integrated deviations - quadratic statistics 

The Cramer-von Mises family of tests measures the integrated quadratic 
deviation of Fn{x) from F{x) suitably weighted by a weighting function t/^: 

/oo 
[F{x)-Fn{x)f^^^{x)dF 
-oo 

In the standardized form we have 

Q-n f [z - F:{z)f i;{z)dz (1) 
Jo 
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Since the construction of F*{Z) includes already an integration, F*{zi) 
and F*{zk) are not independent and the additional integration in Equation 
^ is not obvious. 

With ipcvM = 1 we get the Cramer-von Mises statistic W"^ and tPad = 
[z{l — z)\~'^ leads to the Anderson-Darling statistic A^. 

= n I [z — F*{z)f' dz (Cramer-von Mises) 

/•I \z - F*{z)f 
A^ = n I - — - — , dz (Anderson - Darling) 

Jo z(l - z) 

The Anderson-Darling statistic A^ weights strongly deviations near z = 
and z = 1. This is justified because there the experimental deviations are 
small due to the constraints [z — F*{z)] = at z = and z = 1. 
Watson has proposed a quadratic statistic on the circle: 

U^ = nJ^ S^F*{z)-z- [F*{z) -zjdz'j dz (Watson) 
2.3 The Neyman statistic test 

This test is different from all previously discussed tests. It parametrizes the 
alternative hypothesis and applies the likelihood ratio test. The alternative 
hypothesis corresponds to a pdf of the exponential family: 



'^9ini{z) 



i=l 



gk{z) are smooth alternatives to uniformity. The functions tTj are Legen- 
dre polynomials of order i, 9i are free parameters and C is a normalization 
function. The number k of parameters is selected by the user. 
The likelihood ratio leads to the test statistic 



AT, 



i=l \i=i / 



Asymptotically, for large values of A^^, A^^ is distributed according to the 
distribution with k degrees of freedom. 
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3 NEW TESTS 



3.1 Three region test 

Often experimental distributions are biased by an excess or lack of events 
in certain regions of the random variable. We have designed a test which 
subdivides the variable space into three pieces, containing ni, 712,^3 = n — 
Til — n2 events, such that the deviation between data and prediction from Hq 
is maximum. The test quantity is 

O = sup {wi(ni - npi^ + W2{n2 - np2Y + ^3(713 - np2,f] 



where npk are the expectation values and Wk weights depending on np^. The 
specific choice 



maximizes of the three bins. In the comparison below we have chosen 
weights equal to one. 

Of course the test can be generalized to a higher number of subregions. 

3.2 Minimum energy test 
3.2.1 The idea 

Let us assume that we have a continuous charge distribution p{r) of positive 
electric charges and a sample of negative point charges with total charge equal 
to minus the integrated positive charge. The potential energy is minimum 
when the negative point charges follow p. Then, up to effects due to the 
discrete nature of the point charges, the charge density is zero everywhere. 
Any displacement of charges would increase the energy. We use this property 
to construct a binning free test procedure. 

We simulate the theoretical distribution by m charges of charge 1/m each. 
Usually, these charges are distributed using a Monte Carlo simulation. To the 
n experimental sample points (data points) we associate charges —1/n. The 
test quantity (f) corresponds to the potential energy. It contains two terms 
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npk 
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(f)i, 02 corresponding to the interaction of the experimental charges with each 
other and to the interaction of the experimental charges with the positive 
simulation charges. 



(2) 



i<j 



nm ^ 



(3) 



= 01 + 02 



(4) 



Here dij is the distance between two data points and tij is the distance 
between a data point and a simulation point and i? is a correlation function 
defined below. The sums run over all combinations. 

Remark: The minimum energy requirement for the equality of experi- 
mental and theoretical distribution is strictly correct only when the number 
m of simulation charges is equal to the number n of experimental charges. 
For the general case with a continuous theoretical distribution or simulation 
sample and experimental sample of different size, the optimum agreement 
of the two distributions is not well defined and there is a slight dependence 
of the minimum energy configuration on the correlation function. This is 
however a purely academic problem, the test statistic remains a powerful 
indicator for an incompatibility of the experimental sample with H^. 

3.2.2 The correlation function 

We note that the minimum energy configuration does not depend on the 
application of the one-over-distance power law of electrostatics. We may 
apply a wide class of correlation functions R{r) with the only requirement 
that R has to decrease monotonically with the distance r. 

We have investigated three different types of correlation functions, power 
laws, a logarithmic dependence and Gaussians. 



The first type is motivated by the analogy to electrostatics, the second 
is long range and the third emphasizes a limited range for the correlation 




(5) 
(6) 
(7) 
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between different points. The power n of the denominator in Equ. ^ and 
the parameter s in Equ. ^ may be chosen differently for different dimensions 
of the sample space and different applications. For long range distortions a 
small value of k around 0.1 is recommended. For short range deviations the 
test quantity with larger values around 0.3 is more sensitive. 

The inverse power law and the logarithm have a singularity at r equal 
to zero. Very small distances, however, should not be weighted too strongly 
since distortions with sharp peaks are not expected and usually inhibited by 
the finite experimental resolution. We eliminate the singularity by introduc- 
ing a lower cutoff dmin for the distances d and t. Distances less than dmin are 
replaced by d^^^. The value of this parameter is not critical, it should be of 
the order of the average distance d in the regions where the /o is maximum 
and not less than the experimental resolution. 

The energy test with Gaussian correlation function is closely related to the 
Pearson test. A more detailed description of the energy test is presented 
inRef. [|. 

3.3 Comparison of uni-variate tests 

We have tested the null hypothesis of a uniform distribution in the interval 
[0, 1] using a uniform distribution contaminated by the background distribu- 
tions displayed in Figure |^. 

Background hypothesis A modifies the mean, hypotheses B, C change the 
variance of Hq. 

The power of various tests described above is presented in Figure ||. 

As expected, non of the tests is optimum for all kind of distortions. Sev- 
eral tests perform better than the test. The Neyman test, the Anderson- 
Darling test and the Kolmogorov-Smirnov test are sensitive to a shift in the 
mean. Anderson's test detects especially deviations at the borders of the in- 
terval. Watson's and Kuiper's tests are useful for the detection of distortions 
of the variance. The two new tests compare favorably with the standard 
ones. 



4 MULTIVARIATE TESTS 

The Mardia test and the Neyman smooth test can be used to investi- 
gate two-dimensional Gaussian distributions. The only distribution free test 
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alternative A 
k=3 



alternative B 
! k=2 / 



alternative C 
k=2 
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Figure 2: Different types of background distributions 
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Figure 3: Histograms sfiow tfie powers of different tests at 5% level for the 
sample size n=100. 
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Figure 4: Different types of background distributions in two dimensions 

known to us wliicli is independent of tlie dimensions of the variate space 
is tfie test. A generalized Kolmogorov-Smirnov test depends on the 
ordering scheme of the variates. The binning free energy test developed by 
us is also independent of the number of variates, however the distribution of 
test statistic has to be computed for the specific sample distribution under 
study. 

4.1 Comparison of multivariate tests 

We have used a two-dimensional Gaussian null hypothesis and contaminated 
the sample with the background distributions shown in Figure ^. 

The power of the two Mardia tests, the Neyman smooth test and the 
energy test with logarithmic and Gaussian correlation function is presented 
in Figure |^. 

In most cases the two energy tests perform better than the alternatives 
even though those have been designed for a Gaussian null hypothesis. 

4.2 Example: Comparing experimental data to a Monte 
Carlo prediction 

In Figure ^, left hand side, we compare the position and momentum of a few 
J/ip decay tracks to a Monte Carlo simulation. The right hand plot compares 
the energy computed from the distribution on the left hand side to a Monte 
Carlo simulation of the null hypothesis. The experimental point, indicated 
by the arrow, is larger than all Monte Carlo values. Apparently, the data do 
not follow the prediction. 
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Figure 5: Power comparisons of tests of bivariate normality for the sample 
size n=200 at 5% significance level. 
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Figure 6: Comparison of experimental distribution (squares) with Monte 
Carlo simulation (dots) . The experimental energy computed from the scatter 
plot (left) is compared to a Monte Carlo simulation of the experiment (right) . 
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5 CONCLUSIONS 



The test suffers from the requirement to choose a binning. In one dimen- 
sion it should be replaced by the well established binning free tests like the 
Kolmogorov test. The choice of a specific test has to depend on the expected 
kind of possible distortion of the theoretical distribution. For a localized 
background we advise to use the new three region test. For multivariate 
applications the new energy test is an attractive alternative to the test. 
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